3 Importing Data – Data Science with R

3.1 Introduction

In this chapter, we discuss the main topics regarding data imports. Understanding how to effectively import data is a crucial skill for data analysis, as it forms the foundation upon which all subsequent analysis is built. As such, the principles we will cover are broadly applicable across various software environments.

3.2 Spreadsheets and File Types

Data sets are typically stored in all kinds of formats. Probably the most common type is the table form or electronic spreadsheet (e.g., Excel format). A spreadsheet is similar to a data frame or matrix, because it consists of rows and columns. The type of file determines how we import it into R. Common file types include Excel workbooks, CSV files, or text files with specific delimiters like tabs or semicolons. For instance, a CSV file (Comma-Separated Values) uses commas to separate values within each row. Understanding these file types is crucial because it influences how data is read into R using appropriate functions or packages. For instance, the file customer_churn below is seen with a text editor:

The first row contains headers, which might appear wrapped due to length but, in terms of structure, they are still a single row. By understanding the file type and structure, we can accurately import our data in R.

3.3 Paths and the working directory

Except for the file type, we need to know the path of a file. The path of a file essentially denotes where the file is stored. Usually, we can have these files in organized folders, which are called directories. Although the names may not be so intuitive, the important thing to remember is that, to import a file in RStudio, we need to know its type and where it is. To understand the terminology, suppose we have a csv file called customer_churn in a folder called Data Sets. A possible path in that case would be C:/Users/User/Desktop/Data Set/customer_churn.csv. Let’s break it down:

Full path: C:/Users/User/Desktop/Data Sets/customer_churn.csv
Directory Path: C:/Users/User/Desktop/Data Sets
Directory: Data Sets
File: customer_churn.csv

So, by using the full path, we can import a data set in R. To see how this works in practice, we implement what we just described. The built-in function to import a csv file in R is the read.csv() function. Inside the function we specify the full path and we can store the data in an object directly. In our example, we give the name churn_data to the object in which we want to store the imported data.

  ID Recency Recency_Level Frequency Frequency_Level Monetary_Value
1  1      46           Low        26             Low        3009.60
2  2      40           Low        56          Medium       57347.28
3  3      35           Low       293            High       14496.16
4  4      50           Low        18             Low        1416.20
5  5      77        Medium        14             Low         523.72
6  6      55        Medium        39             Low        8830.32
  Monetary_Value_Level Observation_Period Churn
1               Medium                742     0
2                 High               2301     0
3                 High               2411     0
4               Medium                813     0
5                  Low                  1     0
6               Medium               2077     0

# Import the data 
churn_data <- read.csv("C:/Users/User/Desktop/Data Sets/customer_churn.csv")  

# Print the first 6 rows 
head(churn_data)

When we work with R though, we are always located “somewhere” in the computer in which we work. In other words, R assumes that we have a specific path, from which we work. This is called our working directory. With working directory, there is no need to specify the full path every time when we import a data set; we can simply use the file name instead of the full path inside the function. Before we see how this works, let’s check our current working directory. For this, we can use the function getwd():

# Get Working Directory 
getwd()

[1] "C:/Users/User/Document"

We see that our working directory is C:/Users/User/Document. To change the working directory, we can use the function setwd(). For instance, suppose we want to change the working directory from C:/Users/User/Document to C:/Users/User/Desktop/Data Sets. To do this, we enter the desired directory path inside the parenthesis.

# Change Working Directory 
setwd("C:/Users/User/Desktop/Data Sets")

Now, if we use the getwd() again, we see that our working directory is different.

# Get Working Directory 
getwd()

[1] "C:/Users/User/Desktop/Data Sets"

As our working directory is the Data Sets directory, we can now use the read.csv() function by only filling the name of the file in the parenthesis:

# Import the data
churn_data <- read.csv("Customer_Churn.csv") 

# Print the first 6 rows 
head(churn_data)

  ID Recency Recency_Level Frequency Frequency_Level Monetary_Value
1  1      46           Low        26             Low        3009.60
2  2      40           Low        56          Medium       57347.28
3  3      35           Low       293            High       14496.16
4  4      50           Low        18             Low        1416.20
5  5      77        Medium        14             Low         523.72
6  6      55        Medium        39             Low        8830.32
  Monetary_Value_Level Observation_Period Churn
1               Medium                742     0
2                 High               2301     0
3                 High               2411     0
4               Medium                813     0
5                  Low                  1     0
6               Medium               2077     0

We see that the data import occurs successfully. In this way, we can import different data sets quite efficiently. Another advantage is that, when we share our R script, our code is more readable and other people can easily run the script under the assumption that their working directory contains the same data set. Note that when a file is located in the working directory, we can still use the full path if we want; the result would be exactly the same.

Slash vs Backslash

The difference between a slash (‘/’) and a backslash (‘\’) in the context of a working directory primarily relates to their usage in different operating systems and how they denote paths in a file system:

Slash (/) is commonly used in Unix-like operating systems (Linux, macOS) and URLs.
Backslash (\) is primarily used in Windows operating systems.

However, in R, as in many other programming languages, the backslash (\) is used as an escape character. This means that when R sees a backslash, it expects it to be followed by another character or sequence that represents a special character or command (e.g., \n for newline).

Example:

Incorrect (using a single backslas): "C:\Users\YourName\Documents"
Correct (using double backslashes): "C:\\Users\\YourName\\Documents"
Preferred (using forward slashes): "C:/Users/YourName/Documents"

For ease of use and to avoid errors with escape characters, using forward slashes (/) in file paths is usually the best option when working in R, irrespective of platform.

3.4 Importing Data in RStudio

Now that we understand “path” and “directory”, we can examine in practice how to import a data set in R by using the RStudio functionality. To keep things simple, we import the same data set with the same full path as described. So, the csv file that we import is called “Customer_Churn” and the directory path is C:/Users/User/Desktop/Data Sets. As shown earlier, we can use the read.csv() function to import this data set.

However, RStudio also provides a user-friendly functionality that can help us import our data sets in a relatively straightforward manner. By choosing File -> Import Dataset, we can see that RStudio provides us with different options, such as “From Text (readr)…”. By clicking this option, we will see the following output:

To find the file in our computer, we click the option Browse on the top right corner and find the file by browsing in our computer system. When we find the file we want, we click on it and visualize it on the emerging table:

Generally, we can see a number of options to change how RStudio imports the file. In this way, we see exactly what RStudio will import and, respectively, we can make adjustments before the import takes place. Lastly, we see the exact R code that makes the import on the bottom right corner on the bottom right corner. This is very valuable because not only can we use this code later, but also we can learn how to import a data set by coding directly on the console. In this example we note that RStudio used the readr package and the read_csv() function to make this import possible. This function can be thought of as an advanced version of the read.csv() function. The details are not important at this point; our goal is to capture the intuition about paths, directory and the overall functionality of RStudio regarding data import.

3.5 Importing Data from other sources

It is possible to import data in R from various sources, including relational database platforms such as MySQL, as well as directly from web pages via URLs. Additionally, R can be used for web scraping, which involves extracting data from HTML or directly from web pages. Given the variety of data sources, it’s impractical to cover every possible method in detail. However, the core idea remains the same: we need to guide R to the location of the data and specify the appropriate function for importing it, as different file types require different functions.